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ABSTRACT 

DNA  nanotechnology  often  requires  collections  of  oligonucleotides  called  “DNA  free  energy 
gap  codes”  that  do  not  produce  erroneous  crosshybridizations  in  a  competitive  muliplexing 
environment.  This  paper  addresses  the  question  of  how  to  design  these  codes  to  accomplish 
a  desired  amount  of  work  within  an  acceptable  error  rate.  Using  a  statistical  thermody¬ 
namic  and  probabilistic  model  of  DNA  code  fidelity  and  mathematical  random  coding  theory 
methods,  theoretical  lower  bounds  on  the  size  of  DNA  codes  are  given.  More  importantly, 
DNA  code  design  parameters  (e.g.,  strand  number,  strand  length  and  sequence  composition) 
needed  to  achieve  experimental  goals  are  identified. 

Key  words:  biomolecular  computing,  crosshybridization,  DNA  barcodes,  DNA  codes,  DNA 
computing,  DNA  words,  hybridization,  nearest  neighbor,  random  coding  methods,  self  assembly, 
stacked  pairs,  statistical  thermodynamics,  SynDCode,  tag-antitag  systems,  universal  arrays. 

1.  INTRODUCTION 

DNA  NANOTECHNOLOGY  often  requires  collections  of  oligonucleotides  that  do  not  produce  erroneous 
crosshybridizations.  When  these  collections  consist  of  complementary  pairs  of  oligonucleotides,  i.e., 
are  closed  under  complementation,  they  are  called  DNA  tag-antitag  systems  (Kaderali  et  al.,  2003)  and  DNA 
codes  (D’yachkov  et  al.,  2003,  2005b,  2006).  When  the  collections  need  not  be  closed  under  complemen¬ 
tation  they  are  called  DNA  words  (Andronescu  et  al.,  2003;  Tulpan  et  al.,  2005;  Shortreed  et  al.,  2005)  and 
DNA  barcodes  (Eason  et  al.,  2004).  These  collections  of  non-crosshybridizing  collections  have  applications 
in  SNP  multiplexing  (Cai  et  al.,  2000;  Kaderali  et  al.,  2003;  Fish  et  al.,  2007),  gene  function  identification 
(Eason  et  al.,  2004),  nanostructure  self-assembly  (Valignat  et  al.,  2005),  universal  microarrays  (Hardenbol 
et  al.,  2003),  and  biomolecular  computing  (Braich  et  al.,  2002;  Rose  et  al.,  2004).  Combinatorial,  heuristic, 
and  biological  methods  have  been  suggested  as  a  means  by  which  DNA  codes  can  be  found  and  programs 
exist  that  generate  DNA  codes  (Andronescu  et  al.,  2003;  Bishop  et  al.,  2006;  Chen  et  al.,  2006;  Tulpan 
et  al.,  2005;  D’yachkov  et  al.,  2006;  Penchovsky  and  Ackermann,  2003). 
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In  several  papers  (Zhang  et  al.,  2005;  Tulpan  et  al.,  2005;  Dirks  et  al.,  2004,  2007;  Rose  et  al.,  1999,  2004; 
Horne  et  al.,  2006),  statistical  thermodynamics  is  applied  to  model  competitive  multiplexing  hybridization. 
However,  the  methods  there  are  primarily  numerical  in  nature  and  do  not  provide  detailed  information 
about  how  to  design  collections  of  non-crosshybridizing  strands  to  accomplish  a  desired  amount  of  work 
within  an  acceptable  error  rate.  This  paper  concerns  exactly  this  question  and  presents  a  theoretical,  not 
a  heuristic,  numerical  or  algorithmic,  way  to  decide  this  question.  Using  a  statistical  thermodynamic  and 
probabilistic  model  of  DNA  code  fidelity  coupled  with  mathematical  random  coding  theory  methods  similar 
to  those  presented  in  D’yachkov  et  al.  (2005b,  2003),  theoretical  lower  bounds  on  the  size  of  DNA  codes 
are  given.  More  importantly,  DNA  code  design  parameters,  e.g.,  strand  number,  strand  length  and  sequence 
composition,  needed  to  achieve  experimental  self-assembly  goals  are  identified. 

Single  strands  of  DNA  are  represented  by  (A,  C,  G,  T)  -quaternary  sequences  that  are  oriented,  either 
5'  — >  3'  or  3'  — >  5'.  In  this  paper,  single  stranded  DNA  molecules  without  an  indicated  direction  are 
assumed  to  be  in  the  5'  — >  3'  direction.  The  reverse-complement  of  a  DNA  strand  is  defined  by  first 
reversing  the  order  of  the  letters  and  then  substituting  each  letter  with  its  complement,  A  for  T,  C  for  G 
and  vice-versa.  For  example,  the  reverse  complement  of  AACGTG  is  CACGTT.  Henceforth,  complement 
means  reverse-complement  unless  otherwise  stated.  For  strand  x,  let  x  denote  its  complement.  A  (perfect) 
Watson-Crick  duplex  is  the  joining  of  complement  sequences  in  opposite  orientations  so  that  every  base 
of  one  strand  is  paired  with  its  complementary  base  on  the  other  strand  in  the  double  helix  structure, 
i.e.,  x  and  x  are  “perfectly  compatible.”  However,  when  two,  not  necessarily  complementary,  oppositely 
directed  DNA  strands  are  “sufficiently  compatible,”  they  too  are  capable  of  coalescing  into  a  double 
stranded  DNA  duplex.  The  process  of  forming  DNA  duplexes  from  single  strands  is  referred  to  as  DNA 
hybridization.  Crosshybridization  is  when  two  oppositely  directed  and  non-complementary  DNA  strands 
form  a  duplex.  Crosshybridization  doesn’t  always  occur,  but  there  is  a  potential  for  it  to  happen.  In  general, 
crosshybridization  is  undesirable  as  it  usually  leads  to  experimental  error.  To  increase  the  accuracy  and 
throughput  of  the  applications  listed  above,  there  is  a  desire  to  have  collections  of  DNA  strands,  as  large 
and  as  mutually  incompatible  as  possible,  so  that  no  crosshybridization  can  take  place. 

Definition  1.  Given  two  DNA  strands  x  and  y,  we  let  x  :  y  denote  the  DNA  duplex  formed  between  x 
and  y.  It  is  implicitly  assumed  that  x  and  y  are  oppositely  oriented  in  x  :  y  with  the  first  strand  x  always 
assumed  to  be  in  the  5'  — »  3'  direction  and  the  second  y  always  assumed  to  be  in  the  3 ’  — »  5'  direction. 
A  crosshybridized  (CH)  duplex  is  an  x  :  y,  where  y  x. 

Even  though  it  is  possible  for  complementary  sequences  to  form  a  non-perfectly  aligned  duplex,  we 
will  call  any  x  :  x  duplex  a  Watson-Crick  (WC)  duplex.  Two  oppositely  directed  copies  of  a  single  strand 
x  can  form  x  :  x,  which  is  a  CH  duplex  if  x  is  not  self-complementary,  e.g.,  x  =  ACGT  =  x.  In  the 
discussion  below,  self-complementary  strands  are  largely  forbidden. 


2.  STACKED  PAIRS  AND  UNSTACKED  2-STRINGS  IN 
SECONDARY  STRUCTURES 

Let  x  =  Xi, . . . ,  x„  and  y  =  y  i . yn  be  DNA  sequences.  For  a  base  y j ,  let  y j  be  its  complement 

base.  Then  y  =  yn,...,y \. 

Definition  2.  Suppose  1  ^  ir,  /,-  f  n.  A  secondary  structure  of  the  DNA  duplex  x  :  y  is  a  sequence  of 
pairs  of  complementary  bases  p  =  (x,r,  y„+i-jr)  where  x,r  =  }~„+i_yr  and  (x,r)  and  (y„+i-jr)  are  increas¬ 
ing  and  decreasing  subsequences  of  x  and  y  respectively.  Given  a  secondary  structure  p  =  (x;r,  _y„+i_yr) 
a  stacked  pair  in  a  duplex  is  a  pair  of  consecutively  aligned  complementary  bases,  x,y  =  yn+i—jr, 
Xir+1  =  y„+i-jr+1,  in  p  where  ir+x  =  ir  +  1  and  jr+ 1  =  jr  +  1.  The  notation  xirxir+J yn+l-jryn+l-jr+1 
is  used  to  denote  a  stacked  pair.  An  unstacked  2-string  of  x  in  a  secondary  structure  p  =  (x,r,  yn+\-jr)  is 
a  2-string,  x,x,+ 1,  of  x  that  is  not  part  of  any  stacked  pair  in  p. 

Clearly  the  x  :  y  can  have  many  secondary  structures  and  stacked  pairs  and  unstacked  2-strings  must 
be  defined  relative  to  a  given  secondary  structure. 
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3'  C  *C^AA-A/C  G<TA  5' 

■  ■■■■  ■■  ■■ 

■  ■■■■  ■■  ■■ 

5'  c/Csc/AGT1\  /A - CC.cc  3' 

/  T 


FIG.  1.  An  example  of  a  secondary  structure  in  a  DNA  duplex. 


Example  1.  The  secondary  structure  in  Figure  1  has  stacked  pairs 

AaGs/TuCw,  GsT^/CioAg,  T^T-j/AgA^,  T^A\q/AjT^,  C11C12/G3G2 
where  the  subscripts  indicate  the  position  of  the  bases  in  the  5'  — >•  3'  direction.  Since 

A4G5,  GsT^,  T(,Tj,  7gAio,  C11C12 

are  the  5 ’  — »  3'  bases  in  stacked  pairs  in  the  exhibited  secondary  structure,  then  the  unstacked  2-strings 
in  x  are 


C1C2,  C2C3,  C3A4,  T7Tg,  T%Tth  AioCn,  C12C13,  C13C14. 

Definition  3.  Given  two  DNA  strands  x  and  y,  let  S(x,  y)  and  U(x,  y)  respectively  denote  the  maxi¬ 
mum  and  minimum  number  of  stacked  pairs  and  unstacked  2-strings  over  all  secondaiy  structure  between 
x  and  y  and  in  x. 

A  dynamic  programming  method  to  compute  S(x,v)  is  given  in  D’yachkov  et  al.  (2005a,  2006).  It  is 
implemented  in  DNA  code  software  SynDCode  which  is  available  at  Bishop  et  al.  (2006). 

Definition  4.  Let  L  be  a  collection  of  2-strings  of  DNA  bases  closed  under  complementation,  e.g., 

L  =  { AA ,  TT,AT,  714}  or  {AT}.  A  DNA  sequence  x  =  Xi, ...  ,xn  is  called  an  L  sequence  of  length  n  if 
XjXj+i  f.  L  for  each  1  ^  i  ^  n  —  1.  Let  DNA{n,  L)  denote  the  set  L  sequences  of  length  n.  The  cardinality 
of  DNA{n,  L)  is  denoted  by  A  or  just  A„  when  the  context  is  clear. 

Throughout  this  paper,  k,n,s,u  and  N  denote  positive  integers  and,  whenever  x  e  DNA(n,L)  is 
selected,  it  is  assumed  that  each  such  x  is  equally  likely. 

Proposition  1.  Let  L  =  {AA,  TT,  AT,  TA},  then  |INS-A:  BEGIN | 

XnJ-  =  if  0  *  +  5  (l  +  Vs)  (2.1) 

and 

A n,L  <4(1  +  V5)"  1  .  (2.2) 

Proof.  A  sequence  in  DNA(n,  L)  can  have  at  most  [ of  the  letters  A  or  T  and  no  two  of  these  can 
be  consecutive.  So  given  x  G  DNA(n,  L),  suppose  the  number  of  letters  A  or  T  it  contains  is  k  where 
0  f  k  f  [■?}].  There  are  2k  different  ways  to  arrange  these.  Then  between  any  two  of  the  letters  A  or  T  at 
least  one  G  or  C  must  be  inserted.  By  a  classical  combinatorial  “objects  in  boxes”  type  argument,  there 


(n  ~  k  -  (k  -  1)  +  (k  +  1)  -  1  \  yi—k  _  (n-k  +  l\  k 

\  n-k-(k-l)  )  \  k  ) 


are 
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ways  to  insert  the  n  —  k  letters  G  or  C.  Thus 


It  is  known  that 


where  F(n)  is  recursively  defined  by  F(n)  =  F(n  —  1)  +  F(n  —  2)  and  -F(l)  =  2  and  F( 2)  =  3.  By 
the  solving  the  Fibonacci  recurrence  relation  with  the  given  initial  conditions  F(n)  =  5~3E 
s±^/I  (1W5)"  _  From  this>  (2.1)  foliows.  SinCe 


2 17  I.  — 


'3V5-5 

To 


-O' 


+ 


5  +  3V5 

To 


TV5-5  / a/5  —  1 ' 
v  10  \i  +  vT 

'3V5-5  /  V5-  V 
.  10  \l  +  V5, 

n— 1 


(1  +  V5)"  + 

(1  +  V5)"  + 


('  +  VJ 

5  +  3^5 


10 

5  +  3V5 


(i  +  Vs)" 


10 


(i  +  Vs)" 


=  4(1  +  Vs) 


+ 


(2.2)  is  established. 


Definition  5.  For  x,  y,z  e  DNA(n ,  L)  vv/f/i  x  7^  y,  let 

B„.l,s(x )  =  {y  :  5(x,y)  >  5}. 

An,L,s  =  {z  :  S(z,z)  ^  j}. 

Proposition  2.  Let  L  =  {A4,  TT.AT,  TA}  and  x,  y,z  €  DNA(n ,  L). 

min(.s,n—  5) 

«.  ibi1i0,swk  E  0-i)C7,)24W_y‘- 

7=1 

min(.s,n—  5)  ^  2‘  l\ 

b.  \B,hLM)\  <  E  (;:1)C7)  min  4^+1  (l  +  Vs)"^'”  )  . 

7  =  1  ■  ' 


min(.s,n— .?) 

c.  \An,0'S\  <  E  (m.jKV)4" 

7  =  1  1  1 


s+j 


s~\~j  even 
min(.y,n—  s) 


d •  \An,L,s\  <  J-j) Cy 0  min  (!  +  ^5) 


2n—s—4j—2  > 
2 


j+7  even 
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Proof.  Proposition  2a  appears  as  Proposition  12  in  (D’yachkov  et  al.,  2006)  for  L  =  0.  A  relatively 
straightforward  modification  of  the  proof  there  yields  Proposition  2b  here.  Also  see  (D’yachkov  et  al., 
2005b)  where  a  proof  of  Proposition  2c  appears. 

To  show  Proposition  2b  here,  let  A ,  Y  be  binary  sequences  of  length  n  over  {0,  1}  and  {0,  2}  respectively. 
There  are  (/-i)  ("J')-  Pairs  X  and  Y  such  that  each  have  s  +  j  0.?  with  the  0.5  partitioned  into  j  substrings 
so  that  each  substring  has  at  least  two  0.5  and  these  0.5  are  partitioned  in  exactly  the  same  way  in  X  and 
Y .  For  example,  X  =  000,  1, 00,  1 1, 0000  and  7  =  2,  000,  00,  2, 0000,  2  where  the  common  partition  of 
the  0s  is  000,  00,  0000. 

Fix  x  €  DNA(n,  L )  and  suppose  y  e  I)NA(n ,  L)  has  S(x,  y )  ^  s.  By  the  arguments  in  (D’yachkov  et  al., 
2005a)  and  (D’yachkov  et  al.,  2005b),  for  some  1  <  j  <  min  (.5,  n  —  s)  there  is  a  common  subsequence 
( Xik )  =  (yJk)  of  length  s  +  j  of  x  and  y  respectively  that  can  be  obtained  from  a  pair  A,  7  above  by 
taking  (x/,. ) ,  (y/k )  to  be  the  subsequences  in  x  and  y  that  correspond  to  the  positions  of  the  0.5  is  in  X ,  7 
respectively. 

Since  x  is  fixed,  place  y  in  class  A,  7  if  the  required  common  subsequence  was  obtained  from  this  pair. 
The  number  of  different  y  in  a  given  class  is  the  number  of  {A,  C,  G,  T}  sequences  that  can  arise  from 
aye  DNA(n ,  L)  when  it  is  restricted  to  the  positions  of  the  2s  in  7.  The  number  of  such  subsequences 
that  can  arise  is  at  most 


A b\,L  '  A^.L  *  •••  *  ^bj-\-i,L 


where 


j+ 1 

y,  b\  =  n  —  s  —  j  for  bj  >  0  and  Ao ,l  =  1  • 

i= 1 


From  (2.2) 


A^i,l  •  A^2,l 


•A; 


7+1- 


<  min 


in  Un~s-j,4i+1  (l  +  V5)"  '  21  '  j 


and  Proposition  2b  then  follows. 

To  show  Proposition  2d  here,  let  Z  be  a  binary  sequence  of  length  n  over  {0,  1}.  There  are 


such  Z  that  have  s  +  j  0.v  with  s  +  j  even  and  with  the  0.v  symmetrically  partitioned  into  j  substrings 
so  that  each  substring  has  at  least  two  0s  and  the  first  Atf  ()  v  are  arranged  as  the  mirror  image  of  the  last 
AiA  0.5.  For  example,  e.g.,  Z  =  11,000,  1,000,00,  111,000,000,  1  where  the  symmetric  partition  of  the 
0s  is  000,  000,0 1 0,000,  000. 

Let  z  e  I)NA(n ,  L)  and  suppose  S(z,  z )  7  s.  By  the  arguments  in  (D’yachkov  et  al.,  2005b),  for  some 
1  <  j  <  min(.5,  n  —  s),  there  is  a  Z  as  described  above  such  that  the  positions  of  the  0.5  in  Z  correspond  to 
the  positions  of  a  self-complement  subsequence  of  length  .5  +  j  in  z.  Since  a  self-complement  subsequence 
of  length  .5  +  j  is  determined  by  its  first  s-^2-  entries,  it  follows  that  there  are  most 


min  4" 


_*+i 


l7  + 


m 


+i 


(i  +  Vs) 


(»-^)-(7+[41+0n 


2  n—s—j 
4  2  . 


^l(l  +  V5) 


2n — s — 4/ — 2  ' 
2 


z  in  each  class  Z  because  there  are  at  j  +  1  blocks  of  1  ,s  and 


blocks  of  0.5  for  which  there  is  choice 


to  place  substrings  of  DNA(bj,  L)  to  construct  sequences  that  capture  all  possible  z  in  class  Z. 


The  following  Corollary  1  is  now  trivial. 
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Corollary  1.  Let  0  ^  s  ^  n  —  1.  Suppose  L  =  {A4,  TT.AT ,  TA}.  Select  x,  y  e  DNA(n,  L)  with  x  ^  y. 
Then 


a. 


b. 


1  min(s,fl—  s) 

Pr (S(x,  y )  ^  s)  f  ±-  J2  C-i)(”7)2  min  (  4"_W •  47+1 

A”  j=i 


,  n-i-2j-l 


Pr(S(x,  x) 


„  min  (5  ,n—s) 

1  ^  .  LiJ 


>*<1:  E 


2  n—s—j 

2  , 4 


(l  +  Vs)" 

m(1+v5)[^J' 


j+y  even 


Proof.  Part  a  follows  from  Proposition  2b.  Part  b  follows  from  Proposition  2d.  ■ 

Definition  6.  Let  U(x,  y)  denote  the  minimum  number  of  unstacked  2-strings  in  x  over  all  secondary 
structures  between  x  and  y. 


IlNS-A:  END  | 


If  x  and  y  have  a  maximum  number  of  s  stacked  pairs  over  all  secondary  structures,  then  there  must 
be  a  minimum  number  of  n  —  s  —  1  unstacked  2-strings  among  the  X1X2, . . . ,  x„_ix„  that  are  not  stacked. 
Thus,  U(x,  y)  f  u  if  and  only  if  S(x,  y)  ^  n  —u  —  1.  So  the  next  Corollary  2  follows  from  Corollary  1. 


Corollary  2.  Let  0  ^  u  f  n  —  1 .  Suppose  L  =  \AA.  TT.AT,  TA}.  Select  x.  y  e  DNA{n,  L)  with  x  f  y.  |INS-B:  BEGIN | 
Then 


a.  Pr(t/(x,  y)  <  u)  <  Ffu.n) 


(2.3) 


where 


min(w  +  l,n—  u— 1)  ,  _o  \ 

Ffu,n)  =  —  £  (“-^(“jfmin  ^4“+1^,4^+1  (l  +  Vs)"  J  . 


b.  Pr(t/(x,  x)  ^  m)  <  Fi(u,  n) 


(2.4) 


where 


.  min(n  +  l,n—  u— 1)  ...  /  r„ 

F2{u,n)  =  —  J2  (  [7^!  )C'T1)min(42±i^-4^^  (1  +  ^5) 


n-\-u— 4j  —  1  ^ 
2  _ 


x”  v  171" 

n—  u— l+y  even 


IlNS-B:  END  | 


3.  BOUNDS  ON  NEAREST  NEIGHBOR  THERMODYNAMICS 

Stacked  pairs  play  a  special  role  in  the  Nearest  Neighbor  (NN)  model  of  DNA  duplex  thermodynam¬ 
ics  (SantaLucia,  1998;  Zuker  et  al.,  1999).  Briefly,  local  thermodynamic  functions  AH.  AS.  which  are 
essentially  independent  of  temperature  T.  are  experimentally  found  for  stacked  pairs  and  other  secondary 
structure  motifs  and  then  are  used  in  an  additive  fashion  to  predict  global  thermodynamic  values  for 
duplexes.  Free  energy,  AG.  at  a  given  temperature  T  is  derived  from  AH,  AS  by 

AG  =  AH -T AS  (3.1) 

One  example  of  these  local  functions  for  stacked  pairs  is  given  in  Table  1  (SantaLucia,  1998).  A  dS 
demonstration  of  how  these  local  functions  are  used  to  make  global  predictions  is  given  in  Example  2. 
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Table  1.  Nearest  Neighbor  Thermodynamic  Values 
for  Stacked  Pairs 


Stacked  pail¬ 
s'  3>/3>  _►  5' 

AH  kcal/mol 

AS  cal/°Kmol 

AGjjQoK  kcal/mol 

AA/TT=TT/AA 

-7.9 

-22.2 

-1.02 

AC/TG=GT/CA 

-8.4 

-22.4 

-1.46 

AG/TC=CT/GA 

-7.8 

-21.0 

-1.29 

AT/TA 

-7.2 

-20.4 

-0.88 

CA/GT=TG/AC 

-8.5 

-22.7 

-1.46 

CC/GG=GG/CC 

-8.0 

-19.9 

-1.83 

CG/GC 

-10.6 

-27.2 

-2.17 

GA/CT=TC/AG 

-8.2 

-22.2 

-1.32 

GC/CG 

-9.8 

-24.4 

-2.24 

TA/AT 

-7.2 

-21.3 

-0.60 

Example  2.  The  AH.  AS  of  the  WC  duplex  x  :  x  =  AGTCA.  TCAGT  predicted  by  the  NN  model  is 
computed  by  essentially  summing  associated  Table  1  values  for  the  duplex’s  stacked  pairs, 

AG/TC,  GT/CA,  TC/AG,  CA/GT , 

then  adding  a  constant  initiation  penalty  IP.  Thus  for  AGTCA-TCAGA, 

A  H  (duplex)  =  -7.8  -  8.4  -  8.2  -  8.5  +  IPX  =  -32.9  +  I  Pi  kcal/mol 

AS  (duplex)  =  -0.0210  -  0.0224  -  0.0222  -  0.0227  +  /  P2  =  -0.0883  +  IP2  kcal/°Kmol. 

From  these  computed  values,  the  free  energy  for  the  WC  duplex  is  given  by  (3.1)  and  thus  for  the  duplex 
at  310° K, 


AG(duplex)  =  —5.51  +  IP i  +  IP2  kcal/mol. 

Another  way  to  accomplish  the  same  task  is  to  first  compute  the  AG  for  each  stacked  pair  at  a  given 
temperature,  sum  these  values  and  add  IP  =  I P\  +  IP2.  The  AG  for  stacked  pairs  at  310°^  is  given  in 
the  last  column  of  Table  1. 

Predictions  about  the  thermodynamic  stability  of  CH  duplexes  with  a  given  secondary  structure  can 
also  be  made  from  the  stacked  pairs  that  it  contains.  In  D’yachkov  et  al.  (2005a,  2006),  it  is  argued  that 
the  AG  for  a  CH  duplex  is  bounded  below  the  sum  of  all  the  free  energies  of  the  stacked  pairs  that  it 
contains  plus  IP.  Note,  the  more  negative  AG  means  more  stable.  The  claim  is  strongly  supported  by 
comparing  this  thermodynamically  weighted  stacked  pair  sum  computed  by  SynDCode  (Bishop  et  al., 
2006)  to  computations  made  by  NN  minimum  free  energy  software  Pairfold  (Andronescu  et  al.,  2003)  and 
RNAstructure  (Mathews  et  al.,  2006). 

Example  3.  The  secondary  structure  in  Figure  1  has  stacked  pairs 

A4G5/T11C10,  G5T6/C10A9,  T(,Tj/ AgAg,  TgA\o/A2T g,  C11C12/G3G2. 

Hence  at  310°K  and  using  values  from  Table  1,  the  AG  of  the  duplex  with  the  indicated  secondary  structure 
is  bounded  below  by 


-1.29  -  1.46  -  1.02  -  1.60  -  1.83  =  -7.20  +  IP  kcal/mol. 


Thus  using  the  standard  IP  =  1.96,  SynDCode  ( Bishop  et  al.,  2006)  gives  AG(duplex)  =  —6.24  kcal/mol. 
Pairfold  (Andronescu  et  al.,  2003;  Mathews  et  al.,  2006)  respectively  give  —2.48  and  —2.70  kcal/mol. 
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4.  RELATIVE  STABILITY  AND  PROBABILITY  VIA 
STATISTICAL  THERMODYNAMICS 

One  question  is:  How  does  one  measure  relative  stability  of  DNA  duplexes?  For  example,  given  only  N 

possible  secondary  p\ . p^  structures  for  a  duplex  with  free  energy  AGi, . . . ,  A G  v  respectively  where 

each  AG,  ^  0,  how  likely  is  the  duplex  to  have  p,?  Thinking  of  the  secondary  structures  as  states,  a 
statistical  thermodynamic  partition  function  argument  can  be  applied.  Let  (»,  be  some  real-valued  function 
of  AG,  .  Then  the  partition  function  Q  is  given  by: 


N 

q  =  J2oj<-  (4-d 

i=i 

Using  Q ,  the  probability  that  the  duplex  has  secondary  structure  pi  is  given  by 


Pr(p,)  =  (4.2) 

Typically,  oj;  =  exp(|  AG,  \/ RT)  where  R  =  0.0019872  kcal/°Kmol  is  the  gas  constant  and  T  is  temper¬ 
ature  in  degrees  Kelvin.  Following  this  thread: 

n/  ^  exp(|  AGi  \/RT)  1 

Pr(pi)  =  — -  =  - .  (4.3) 

f>xp(|AG,-|/Rr)  1  +  £exp(-(|AG,|  -  |AG,I)/Rr) 

i= l 

If  for  all  /  ^  1,  |  AG,  |  -  |  AG,- 1  ^  g  Js  0.  Then 


Pr(Pi)  ^ 


1 

l  +  (N  —  1)  exp(— g/ RT) 


(4.4) 


This  leads  to  the  use  of  the  minimum  free  energy  gap  as  a  measure  of  performance.  This  is  explored  in 
the  next  section  (Tulpan  et  al.,  2005;  Penchovsky  and  Ackermann,  2003;  Shortreed  et  al.,  2005). 


5.  STATISTICAL  THERMODYNAMIC  SIGNILICANCE 
OL  THE  FREE  ENERGY  GAP 


Since  each  possible  stacked  pair  can  be  identified  with  its  2-string  in  the  5'  — >  3'  strand  (usually  x)  in 
the  duplex,  the  absolute  values  of  thermodynamic  parameters  in  Table  1  are  a  function  of  2-strings  of  bases. 
These  positive  functions  are  given  in  Table  2  where  AH,  AS  are  renamed  as  ///  and  fs  respectively. 


Table  2.  Positive  Thermodynamic 
Functions  on  Two  Strings 


Two-string 

5'  -x  3' 

fH 

fs 

fG,310 

AA=TT 

7.9 

22.2 

1.02 

AC=GT 

8.4 

22.4 

1.46 

AG=CT 

7.8 

21.0 

1.29 

AT 

7.2 

20.4 

0.88 

CA=TG 

8.5 

22.7 

1.46 

CC=GG 

8.0 

19.9 

1.83 

CG 

10.6 

27.2 

2.17 

GA=TC 

8.2 

22.2 

1.32 

GC 

9.8 

24.4 

2.24 

TA 

7.2 

21.3 

0.60 
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The  associated  absolute  value  of  the  free  energy  for  stacked  pairs,  faj,  at  a  given  temperature  T  can 
be  computed  by  using  the  following  version  of  the  classical  equation  (3.1): 


fc.T  =  T  ■  fs  ~  ftt-  (5.1) 

For  example,  /g,3io  is  given  in  the  last  column  of  Table  2.  Note  that  each  of  the  functions  /// ,  fs  and 
fc.T  have  the  property  that  /(x,x,+i)  =  /(x,+ix,),  i.e.,  they  are  invariant  under  complementation. 

Henceforth  ///,  fs  will  be  arbitrary  positive  functions  on  2-strings  invariant  under  complementation 
and  fo  j  will  be  assume  to  have  been  derived  from  such  via  (5.1). 

Definition  7.  Given  distinct  and  non-self-complementary  strands  x,  y  in  i)NA(n ,  L),  consider  the  WC 
duplex  x  :  x  and  the  CH  duplexes,  x  :  x,  x  :  y.  Let  ||x,x||  denote  the  absolute  value  of  the  AG  of  the 
WC  duplex  x  :  x  in  perfect  alignment  and  let  \\x,  x||  and  \\x,  y||  denote  the  absolute  values  of  the  AG  of 
most  stable  secondary  structures  for  the  x  :  x,  x  :  y  CH  duplexes  The  quantity 

\\x,x\\  -  \\x,y\\ 

is  plainly  referred  to  as  the  asymmetric  free  energy  gap. 

Given  distinct  and  non-self-complementary  strands  Xi,X2 ,...,xjy  in  DNA(n ,  L),  think  of  strand  x\ 
having  N  +  1  possible  states.  It  can  either  form  a  WC  duplex  with  its  complement  x)  or  it  can  form  a 
CH  duplex  with  one  of  the  other  N  strands  xi,  X2, . . . ,  x#  (itself  included  as  there  may  be  multiple  copies 
of  each  strand.)  Let  x  =  X\ .  Then  following  the  partition  function  argument  given  in  (4.3)  and  (4.4),  the 
probability  that  x  forms  a  WC  duplex  is 


Pr(x  :  x)  = 


exp(||x  :  x\\/RT) 

N 

exp(||x  :  x\\/RT)  +  exp(||x  :  x\\/RT)  +  ^exp(||x  :  Xi\\/RT) 

i=2 


(5.2) 


Now  suppose  that  for  i  1,  ||x,x||  —  ||x,x||  and  ||x,x||  —  ||x,x,||  are  all  at  least  g  f  0.  Then  by  (4.4) 
and  (5.2), 

Pr(*  :  x)  f  — 7 - Tpt\'  (5-3) 

1  +  (N  +  l)exp  (-g/RT) 

Suppose  C  =  {{x;, X;}}^j  is  a  collection  of  N  complementary  pairs  of  strands  in  DNA(n,L)  of  multi¬ 
plicity  2,  such  that  the  2 N  strand  types  are  distinct  and  thus  no  strand  type  is  self-complementary.  Then 
there  are  a  total  of  4 N  strands  in  solution.  Suppose  further  that  ||x,-,  x,  ||  —  ||x,-,  y||  f  g  f  0  for  all  y  =  Xy 
or  y  =  Xj  where  j  f  i.  Then  for  any  strand  x,- : 


Pr(x,  :  x, j  =  Pr(x,  :  x,  )  ^ 


1 

1  +  (2iV  —  1)  exp(— g/ RT) ' 


(5.4) 


Under  the  reasonable  assumption  that  the  formation  of  the  x,  :  x,  duplex  is  independent  (or  doesn’t  reduce 
the  probability)  of  the  formation  of  the  xj  :  xj  duplex,  then: 


Given  the  entire  collection  C  of  AN  strands,  the  probability  that  2 N  WC  duplexes  form  so  that  there  are 
no  CH  duplexes  is  at  least 


1 


2  N 


\l  +  (2N-l)cxp(-g/RT)J 

If  (5.5)  is  to  be  at  least  a  where  0  <  a  <  1,  then  solving  for  g  in  terms  of  a  and  N  yields 

fa-ll2N-\\ 


(5.5) 


g  =  -RT  In 


2  N  -  1 


(5.6) 
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Definition  8.  Given  faj  and  any  complementary  pair  x  :  x,  define 

n— 1 

\u,x\\fG,T  =  Y^fG’T(xiXi+^-  (5-7) 

;= i 

Suppose  x  and  y  are  non-complementary  strands.  Given  a  secondary  structure  p  =  ( Xjr ,  yn+\-jr )  between 
x  and  y,  let  A{p)  be  the  stacked  pairs  in  p.  Define 

II*-  yWfexp  =  fGj(xJrxjr+1).  (5.8) 

XJrXjr  +  l€Mp) 

II*-  y\\fGT  =  max(||x,  y\\fGJ.p).  (5.9) 

P 

K/g.t- (*-.*)  =  II*- *11 fGJ  -  \\x,y\\fGJ  (5.10) 

YfGT(x,y)  is  called  the  asymmetric  stacked  pair  free  energy  gap.  y  is  written  for  y  fGT  when  the  context 
is  clear. 

Proposition  3.  Let  x.  y  e  DNA(n ,  L).  Suppose  0  ^  c  f  for  for  all  x,  x,+i  f  L.  Then  yfGT{x,  y)  > 
U(x ,  y)  ■  c 

Proof.  Let  p  =  (xJr ,  y„ + 1 -Jr )  be  a  a  secondary  structure  between  x  and  y,  let  Bx(p)  be  the  unstacked 
pairs  in  x  relative  to  p.  From  (5.7)— (5. 1 1),  it  follows  that 

YfG.r(x,y)=  J2  fG,T(xkrxkr+l) 

skrxkr+l€Bx(p) 

for  some  p.  Since  for  every  p,  |5A-(p)|  >  U(x ,  y),  the  result  follows.  ■ 

Note  y(x,  x)  =  0  and 

\\x,  y\\fG,T  =  Il3h*ll/G,r 
\\x,  y\\fG,T  =  \\x,y\\fGJ 


so 


y(x,y)  =  y(x,y). 


However,  since 

\\x,x\\fG.T  7 k  llj,  jll/G,r 


then 


Y(x,y)  ^  K(v,x) 

and  this  is  why  the  term  “asymmetric”  is  used. 

In  D’yachkov  et  al.  (2005a,  2006),  it  is  discussed  how  ||x,  y\\fGT  —  IP  is  an  upper  bound  on  II*-  y\\ 
when  ///,  fis  are  as  given  in  Table  2.  The  NN  model  gives  that  ||x,x||  =  ||x,x||/Gr  —  IP.  so  it  follows 
that 


y(x,y)  ^  II x ,  x j|  -  ||x,  y||. 


(5.11) 


Thus  the  asymmetric  stacked  pair  free  energy  gap  is  a  lower  bound  for  the  asymmetric  free  energy  gap. 
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Definition  9  is  central  to  this  paper. 

Definition  9.  Let  far  be  as  given  in  equation  (5.1)  for  some  ///.  f$.  A  DNAfGT(n,  L,  g)  code  C  is 
a  collection  of  complementary  pairs  in  DNA(n,  L)  such  that  no  strand  is  self-complementary  and  for  any 
two  non-complementary  strands  x,  y  in  C: 


YfGj(x,y )  ^  g. 


DNAfGT(n.  L ,  g)  codes  are  also  called  free  energy  gap  DNA  codes. 

The  DNA  code  software  SynDCode  (Bishop  et  al.,  2006)  generates  DNAfGT(n,  L.  g)  codes  C  with 
many  additional  and  optional  user  added  sequence  constraints  (see  Example  7  and  Conclusion).  It  should 
also  be  noted  that  for  x,  y  €  C  that  since  y(x,  y)  >  g  and  y(y,  x)  >  g  that 

min(||x,  x||,  \\y,y\\  -  ||x,y||)  >  g.  (5.12) 

Thus  DNAfGT(n.  L,  g)  codes  are  nearly  of  the  type  discussed  in  Tulpan  et  al.  (2005),  Penchovsky  and 
Ackermann  (2003),  and  Shortreed  et  al.  (2005),  and  in  which  (5.12)  is  nearly  one  of  the  of  main  constraints 
(see  Conclusion). 


6.  RANDOM  CODING  BOUND  FOR  HIGH  FIDELITY 

DNA  fG  T(n,  L,  g)  CODES 

Equations  (5.6)  and  (5.11)  provide  a  relationship  between  the  free  energy  gap  and  the  probability  of 
correct  self-assembly. 

Definition  10.  Suppose  only  the  strands  of  a  DNAfGT(n,  L,  g)  code  C  are  present  in  solution  in  equal 
concentrations.  For  0  <  a  <  1,  then  C  self-assembles  with  fidelity  a  if  a  is  the  probability  that  every 
strand  in  C  forms  a  WC  duplex. 

If  the  model  of  having  exactly  two  copies  of  the  strands  of  a  DNA  jGT(n,  L.  g)  code  C  is  assumed  to  be 
reasonable  for  strands  in  equal  concentrations,  then  from  equation  (5.11),  a  DNA  fGT(n,  L,  g)  code  C  with 
N  pairs  has  fidelity  a  if,  for  any  two  non-complementary  strands  x,  y  in  C,  the  asymmetric  free  energy 
gap  y(x,v)  satisfies: 


/  q,  l/2iV  —  j  \ 

y(x,y)  S>-JRrin^___j  =  g.  (6.1) 

Below  a  random  coding  theory  method  is  used  to  get  a  lower  bound  on  the  number  of  complementary 
pairs  A  of  a  DNAfGT(n ,  L,  g)  code. 

Example  4.  If  C  is  a  DNA  fGT(n,  L,  g)  with  N  pairs  of  multiplicity  2  where  g  is  given  by  (6.1)  when 
a  =  1  —  then  out  of  the  desired  2 N  WC  duplexes,  an  expected  number  of  one  WC  duplex  does  not 
form.  In  Figure  2,  a  sufficient  stacked  pair  free  energy  gap  g(a ,  N)  with  a  =  1  —  is  plotted  against 
logw(N). 


Definition  11.  Let  U  =  {{x,  x}  :  {x,  x}  e  DNA(n.  L)}.  An  E  C.U  is  called  a  random  DNA(n.  L)  k-set 
of  pairs  if  \E\  =  k  and  the  uniform  distribution  is  on  the  k-sets  oflA. 

For  the  remainder  of  this  paper  E  is  assumed  to  be  a  random  DNA(n,  L)  k-set  of  pairs. 


FREE  ENERGY  GAP  AND  DNA  CODES 


1099 


Definition  12.  For  a  real  number  g  >  0,  we  say  that  a  complementary  pair  {x,-, 
E  if  for  any  other  {Xj ,  Xy}  in  E  either: 

Xi}  is  y/GT  g-bad  in 

yfGT{Xi,Xi)  =  yfGT {Xi.Xj)  <  g. 

(6.2) 

Y/g.t  (xi  >  Xj)  =  yfG  T  (X; ,  Xj  )  <  g, 

(6.3) 

YfGT  (• Xi ,  Xj)  =  yfGT  (Xi ,  Xj  )  <  g. 

(6.4) 

A  complementary  pair  {x,-,  X;}  is  called  y/GT  g-good  in  E  if  it  is  not  gfGT  g-bad. 

The  following  is  obvious: 

Proposition  4.  The  collection  ofgfGT  g-good  pairs  in  E  is  an  DNAfGT(n,  L,  g)  code. 


Lemma  1.  Let  x,  y  €  DNA(n.L)  be  randomly  selected  without  replacement.  Then  there  exists  a  |INS-C:  BEGIN | 
DNAfGT(n,  L,  g)  code  with  N  complementary  pairs  of  strands  from  DNA(n,  L)  if 

Pr(y(x,  x)  <  g)  +  (2 N  -  1)  Pr  (y(x,  y)  <  g)  ^  (6.5) 

Proof.  Let  fg(x,  y)  be  the  probability  that  either  y(x,  y)  <  g  or  y(x,  y)  <  g.  Since  Pr (y(x,y)  < 
g)  =  Pr (y(x,  y)  <  g)  ,  then 

Pg(x,  y)  <  2Pr(y(x,  y)  <  g).  (6.6) 

Let 

fg(x,x)  =  Pr(y(x,x)  <  g) .  (6.7) 

Let  | E\  =  2 N.  From  the  additive  bound  of  the  probability  of  the  union  of  events,  it  follows  that  the 
probability  that  {x,-,  x,  }  is  y  g-bad  in  E  is  at  most 


fg(x,x)  +  (2 N  -  \)fg(x,y). 


(6.8) 
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Thus  the  probability  that  { Xj .  x,  j  is  good  in  E  is 


1  -  (pg(x,  x )  +  (2 N  -  1  )Pg(x,  y)) . 

(6.9) 

The  main  point  of  all  of  this  is  that  the  expected  number  of  good  pairs  in  E  is 

2N  (!  -  (Pg(*> x)  +  ( 2N  ~  1  )Pg(x,  y)))  ■ 

(6.10) 

and  (6.10)  will  be  at  least  N  when 

Pg(x,  x)  +  ( 2N  ~  1  )Pg(x,  y)  f 

(6.11) 

Thus  given  (6.11),  there  must  exist  an  E  that  contains  N  yfGT  g-good  pairs  in  E. 
from  Proposition  4. 

Hence  the  result  follows 

■ 

Proposition  5.  Given  faj,  suppose  0  $5  ('■  f  fcj  for  xi xi  + 1  f  L.  Let  g  >  0  and  suppose  that  4 

is  not  an  integer.  Then  for  distinct  x,  y  €  DNA(n,  L) 

a ■  Pr (yfGJ(x,y)  <  g)  ^  Ffi  |  ,n). 

(6.12) 

b-  Pr(yfGT(x,x)  <  g)  f  F2(  |  ,n). 

(6.13) 

Proof.  Applying  Proposition  3,  it  follows  that 

pr (YfG.T(x,  v)  <  g)  <  Pr (U(x,  y)  f  |  ). 

(6.14) 

The  result  follows  from  Corollary  2. 

■ 

Theorem  1.  Let  L  =  {A4,  IT.  AT.  TA}.  Given  faj  ■  suppose  0  ^  c  f  faj  for  all  x,x,+i  f  L.  Let 
g  >  0  and  suppose  that  4  is  not  an  integer.  Let  8\ (N,  n,  g,c)  and  82{N,  n ,  g ,  c)  be  as  given  in  Proposition  5. 

Define  Bad^N,  n,  g ,  c)  to  be: 

BadL(N,n,g,c)  =  F2(  -  ,n)  +  (2N  -  1)F,(  -  ,«).  (6.15) 

Lc J  Lc  J 

If  BaddN,  n,  g,  c )  ^  i,  then  there  exists  a  DNAfGT(n,  L ,  g)  code  with  N  complementary  pairs. 

Proof.  Apply  Lemma  1  and  Propositions  4  and  5.  ■  |INS-C:  END 


7.  RESULTS 

In  the  Examples  5-7  below,  let  Li  =  J  A  A .  TT.  AT,  TA}  and  L2  =  0.  The  probabilities  for  the  L2  case 
can  be  obtained  by  using  Proposition  2  parts  a  and  c.  Given  faj  for  fn .  f$  in  Table  2,  the  appropriate 
value  for  c  in  Theorem  1  is  c\  =  1.29  and  c2  =  0.60  respectively.  As  in  Example  4,  for  a  given  N ,  which 
denotes  the  number  of  pairs  in  a  desired  code,  let  aj  =  1  —  ^  and  let  g.\/  =  g(ctj,  N )  be  given  by  (5.6) 
when  T  =  310°K. 


Example  5.  In  Figure  3,  using  the  left  y-axis,  the  blue  and  yellow  graphs  plot  the  points  (f  i ,  log10  N\ ) 


and  {t2,  logjQ  N2)  respectively  that  minimize  even  f,-  for  given  Nj  subject  to  the  constraint  that  Bad  Li 
( N ,  f,-,  gN,  cfi  ^  ^  as  given  in  (6.15).  The  right  y-axis  gives  the  corresponding  points  (f,-,  gw,).  Thus,  tj  is 
the  theoretically  computed  sufficient  length  of  DNA  strands  such  that  there  is  a  DNA  p.  T  (f,- ,  L, ,  g,- )  code 


that  self-assembles  with  fidelity  a,-  as  defined  in  Definition  10.  Note  that  f,  is  the  sufficient  length  of  DNA 


strands  such  that  there  is  a  DNA  fGT(nj ,  L, ,  gj)  code  that  self-assembles  with  an  expected  failure  rate  of 


one  WC  duplexes  failing  to  form  per  all  possible  2  N  WC  duplexes  in  the  code  (of  strand  multiplicity  two.) 
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FIG.  3.  Theoretical  and  empirical  coding  bound. 


To  take  a  specific  instance,  if  N\  =  1000,  then  a n  =  0.9995,  so  gN  =  14.05.  Thus,  a  sufficient  strand 
length  is  t\  =  33,  because 

BadLl(1000,  33,  14.05, 1.29)  =  0.42 


while 


BadLl(  1000,  32,  14.05,  1.29)  =  1.02. 


However,  tz  =  54,  because 


BadLl(  1000,  54,  14.05,  1.29)  =  0.23 


while 


BadLl(  1000,  53,  14.05,  1.29)  =  0.64. 

This  difference  depends  on  the  difference  between  L\  and  Z/>  and  is  discussed  in  the  Conclusion  section 
below. 

Example  6.  Given  for  and  Lj,  a n  and  gN  as  in  Example  5,  the  pink  and  light  blue  graphs  in 
Figure  3  respectively  plot  the  points  (ei,log10Ai)  and  (e2,)ogw  Nf)  that  minimize  strand  length  e,-  for 
given  Nj  subject  to  the  constraint  that 


Pi,g(x,  x)  +  (2 N  -  1  )Pi,g(x,  y )  < 


1 

2 


as  given  in  (6.11)  and  where 

Pi,g(x,  x )  +  (2 N  -  1  )Pi.g(x,  y) 

was  empirically  estimated  by  performing  106  trials  sampling  from  DNA(ej ,  Lf).  Thus  e,-  empirically  esti¬ 
mates  the  sufficient  length  of  DNA  strands  such  that  there  is  DNAfGT(ei ,  Li ,  gN)  code  with  N  pairs  that 
self-assembles  with  fidelity  cun  as  defined  in  Definition  10.  To  take  a  specific  instance,  if  N\  =  1000, 
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then  the  empirically  estimated  sufficient  strand  length  is  e\  =  21  for  strands  that  don’t  contain  2-strings 
in  {AA,  TT,  AT,  TA}.  Note  that  for  Ni  =  1000,  the  empirically  estimated  sufficient  strand  length  is 
e\  =  28  for  strands  with  no  restrictions  on  their  2-strings.  In  general,  the  graphs  indicate  that  codes  taken 
from  DNA(n,  L\)  have  slightly  better  emperically  estimated  random  coding  bounds  than  those  taken  from 
DNA(n ,  L2). 

Example  7.  Given  far  and  L\,  a n  and  gx  as  in  Example  5,  the  maroon  graph  in  Figure  3  plots  the 
points  (i,  log10  N )  that  minimize  strand  length  s  for  given  N  subject  to  the  constraint  that  the  DNA  code 
software  SynDCode  ( Bishop  et  al.,  2006)  has  produced  a  DNAfGT(s,  L\,  gw)  code. 


8.  CONCLUSION 

A  new  type  of  DNA  code,  called  a  DNAfGT(n ,  L,  g)  free  energy  gap  code  has  been  defined  and  a 
statistical  thermodynamic  probabilistic  model  for  DNAfGT(n,  L,  g)  code  self-assembly  has  been  given. 
Theoretical  and  empirical  random  coding  lower  bounds  for  DNAfGT(n ,  L,  g)  codes  were  obtained  and 
used  to  determine  sufficient  DNA  code  design  parameters  needed  to  achieve  experimental  goals.  It  has 
been  noted  that  small  changes  in  sequence  composition,  e.g.,  whether  the  consecutive  pairs,  AA,  TT, 
AT  or  TA  appear  in  any  code  sequence  makes  a  difference  in  the  potential  fidelity  of  a  DNA  code,  but 
perhaps  not  such  as  wide  a  difference  in  the  theoretical  sufficient  strand  length  bounds  between  the  cases 
when  L\  =  {A4,  TT,AT ,  714}  and  Li  =  0  as  is  shown  in  Example  5.  By  observing  the  empirical  results 
in  Examples  6  and  7,  it  is  clear  the  theoretical  bound  for  the  case  7,2  =  0  exhibited  in  Figure  3  by 
the  yellow  graph  is  poor  in  comparison  to  that  given  for  the  case  L\  =  {A4,  TT,  AT,  TA }  exhibited  in 
Figure  3  by  the  blue  graph.  This  seems  to  be  because  that  in  the  Li  =  0  case,  the  c  =  0.60  for  Thereom 
1  is  far  further  from  the  average  value  1.42  of  foj  over  the  stack  pairs  not  in  7.2  =  0  than  is  the 
c  =  1.29  in  the  L\  =  {AA,  TT ,AT ,  TA}  case  from  average  value  1.59  of  foj  over  the  stack  pairs  not  in 
L\  =  {AA,  TT,  AT,  TA}. 

Earlier  work  on  free  energy  gap  collections  of  oligos  is  summarized  in  Tulpan  et  al.  (2005).  The  results 
presented  here  suggest  that  the  free  energy  gaps  for  collections  that  are  given  there  may  be  too  small.  As 
discussed  in  Example  5,  the  fidelity  a  of  a  code  with  N  pairs  should  be  greater  than  1  —  yL  if  that  code 
is  to  self-assemble  with  an  expected  failure  rate  of  less  than  one  WC  duplex.  Thus,  if  N  >  64,  it  follows 
from  (5.6)  that  the  free  energy  gap  must  be  greater  than  8.95.  However,  none  of  the  19  collections  given 
in  Table  2  of  Tulpan  et  al.  (2005)  with  N  >  64  have  a  free  energy  gap  greater  that  8.95.  Moreover,  the 
SynDCode  data  exhibited  in  Figure  3,  indicates  that  for  strands  of  length  16,  codes  larger  than  those  listed 
in  Tulpan  et  al.  (2005)  may  be  found.  Other  sequence  constraints  for  collections  of  oligos  are  considered 
in  Tulpan  et  al.  (2005).  SynDCode  (Bishop  et  al.,  2006)  also  allows  for  consideration  for  many  of  these 
same  constraints.  For  example,  the  code  S8-2  given  in  Tulpan  et  al.  (2005)  is  nearly  a  DNA fGT  (16,  L,  7.85) 
code  where  L  =  {GGG,  CCC}.  It  is  not  exactly  such  a  code  for  several  reasons.  The  two  most  significant 
reasons  are  as  follows: 

1.  S8-2  doesn’t  constrain  the  what  is  called  the  word-word  crosshybridization  potential  (because  an  un¬ 
derlying  in  assumption  (Tulpan  et  al.,  2005)  is  that  the  strands  %  are  fixed  to  a  surface.) 

2.  S8-2  uses  Pairfold  (Andronescu  et  al.,  2003)  to  measure  the  free  energy  gap  while  DNA fGT  (16,  L,  7.85) 
uses  yfGT. 

As  was  noted  earlier,  Yfcr  is  more  restrictive  that  Pairfold  (see  Example  3).  Some  of  the  most 
important  additional  constraints  for  S8  —  2  are: 

For  x  in  58  —  2  that: 

a.  CC  is  not  contained  at  the  start  or  end  of  x  and  therefore  GG  is  not  contained  at  the  start  or  end 
of  x. 

b.  G  is  not  contained  in  x  and  therefore  C  is  not  contained  in  X. 

c.  15.45  <  ||x,x||/Gr  —  IP  <  16.42  where  we  are  assuming  IP  =  1.96. 

SynDCode  can  generate  DNAfGT(n,  L,  g)  with  each  of  these  additional  constrains.  In  particular,  in 
Figure  4,  an  example  of  a  DNA/GT  (16,  {GGG,  CCC},  7.85)  code  is  provided  that  satisfies  conditions  a-c. 
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Code  Strands 

TCCTAAACCATTCTTA , 

Si 

TAAGAATGGTTTAGGA , 

Ci 

CATCAAAAAACTCAAT , 

S50 

ATTGAGTTTTTTGATG , 

C50 

ACACATACTACTTCAA , 

s2 

TTGAAGTAGTATGTGT , 

c2 

AAAACACTTCTTTACT , 

S51 

AGTAAAGAAGTGTTTT , 

C51 

TTTCATTTCCAAAAAC , 

s3 

GTTTTTGGAAATGAAA , 

c3 

TTTTTACCTTAACCTC , 

S52 

GAGGTTAAGGTAAAAA , 

C52 

TAATAAACACACTCCA , 

s4 

TGGAGTGTGTTTATTA , 

C4 

CAATCTCTCTTCATAC , 

S53 

GTATGAAGAGAGATTG , 

C53 

CACTCTCACATTAAAA , 

S5 

TTTTAATGTGAGAGTG , 

Cs 

ACACTAAAAAAACAAC , 

S54 

GTTGTTTTTTTAGTGT , 

C54 

AC AT  CT  T  CCT AT A  CAA , 

s8 

TTGTATAGGAAGATGT , 

c6 

ATCATAACACAAACTT , 

S55 

AAGTTTGTGTTATGAT , 

C55 

AT  CCAT  CAAACAT AAA , 

s7 

TTTATGTTTGATGGAT , 

c7 

CACACATCATTTTTTC , 

S56 

GAAAAAATGATGTGTG , 

C56 

TTTCTCAATCCTACTA , 

s8 

TAGTAGGATTGAGAAA , 

Cs 

CTATCCACCTAAAAAA , 

S57 

TTTTTTAGGTGGATAG , 

C57 

TTATACACCTCACTTT , 

S9 

AAAGTGAGGTGTATAA , 

c9 

TATATCCTCTCCAATC , 

S58 

GATTGGAGAGGATATA , 

C58 

AAAATTCATCAACCAA , 

S10 

TTGGTTGATGAATTTT , 

C10 

CTACATATCTAACCAC , 

S59 

GTGGTTAGATATGTAG , 

C59 

CAACCACTTAATCTTC , 

S11 

GAAGATTAAGTGGTTG , 

C11 

CTTACTACCAAACTTC , 

S60 

GAAGTTTGGTAGTAAG , 

C60 

ACCTTTTCTAACTCAT , 

S12 

ATGAGTTAGAAAAGGT f 

C12 

AAAAAATCCACATTTC , 

S6i 

GAAATGTGGATTTTTT , 

Cei 

ACTTATAAACCTACCA , 

S13 

TGGTAGGTTTATAAGT # 

C13 

AATTCTCCACTTTAAC , 

S62 

GTTAAAGTGGAGAATT , 

C-62 

ACTCACCTCAATATAC , 

S14 

GTATATTGAGGTGAGT , 

C14 

AATTTCAACTATCACA , 

S63 

TGTGATAGTTGAAATT , 

C63 

T  CCAT ATT  CAT ACCT  C , 

Sl5 

GAGGTATGAATATGGA , 

C15 

CAAAAAAAAT  CT  CCT  C , 

S64 

GAGGAGATTTTTTTTG , 

C64 

CACAATTTTTACAACT , 

S16 

AGTTGTAAAAATTGTG , 

Ci6 

TCACTTTTACCATTAC , 

S65 

GTAATGGTAAAAGTGA , 

C65 

ACTCTTTCTTTTCCTT , 

S17 

AAGGAAAAGAAAGAGT , 

C17 

TACTCCAACATTTTAC , 

S66 

GTAAAATGTTGGAGTA , 

C66 

TCTCATTAACATACAC , 

S18 

GTGTATGTTAATGAGA , 

Cl8 

AAAACACCACAAATAA , 

S67 

TTATTTGTGGTGTTTT , 

C67 

CATTCTTACTCTCTCT , 

S19 

AGAGAGAGTAAGAATG , 

C19 

TAACTTTCTATCTCCA , 

S68 

TGGAGATAGAAAGTTA , 

c68 

CTCTCTCAACTTAATC , 

S20 

GATTAAGTTGAGAGAG f 

C20 

CTCCTCTTCTCTATTA , 

S69 

TAATAGAGAAGAGGAG , 

C69 

AAACTAACATCAACAT , 

S21 

ATGTTGATGTTAGTTT , 

C2l 

ACCTAACAATACAATC , 

S70 

GATTGTATTGTTAGGT , 

C70 

ATTCAAATCTTCCATC , 

S22 

GATGGAAGATTTGAAT , 

C22 

TCCTAAAATATCACAC , 

S71 

GTGTGATATTTTAGGA , 

C71 

CTCTAACTACTACCTT , 

S23 

AAGGTAGTAGTTAGAG , 

C23 

ACCAAAATACTCTTAC , 

S72 

GTAAGAGTATTTTGGT , 

C72 

CTTCATCATTTCTCTT , 

S24 

AAGAGAAATGATGAAG , 

C24 

ACAACTTCTCAATTTT , 

S73 

AAAATTGAGAAGTTGT , 

C73 

ATTAACTCCTCCTAAA , 

S25 

TTTAGGAGGAGTTAAT , 

C25 

AACATTACCTAATCAC , 

S74 

GTGATTAGGTAATGTT , 

C74 

CAACATACAATTACCA , 

S26 

TGGTAATTGTATGTTG , 

C26 

ATAACTCATCTTTCAC , 

S75 

GTGAAAGATGAGTTAT , 

C75 

CTCTTACAAAACATCT , 

S27 

AGATGTTTTGTAAGAG , 

C27 

TACACAACATATCCTA , 

S76 

TAGGATATGTTGTGTA , 

C76 

AACCTCTAAAATCCTA , 

S28 

TAGGATTTTAGAGGTT , 

C28 

CATCCACTTTTTTTTT , 

S77 

AAAAAAAAAGTGGATG , 

C77 

CAAAAACCAACTTTAA , 

S29 

TTAAAGTTGGTTTTTG , 

C29 

CTTCTACATCTCAAAA , 

S78 

TTTTGAGATGTAGAAG , 

C78 

TCTCTAATCACCAATA , 

S30 

TATTGGTGATTAGAGA , 

C30 

CTAAATACCATCCAAC , 

S79 

GTTGGATGGTATTTAG , 

C79 

ATATTACCACATCCTT , 

S31 

AAGGATGTGGTAATAT # 

C31 

ACTATCCAATTAACAC , 

S80 

GTGTTAATTGGATAGT , 

Cso 

CATTTCACTCAAATTT , 

S32 

AAATTTGAGTGAAATG , 

C32 

TTTTTTATCTCACACA , 

Ssi 

TGTGTGAGATAAAAAA , 

C81 

CTTTACTTCCACAATA , 

S33 

TATTGTGGAAGTAAAG , 

C33 

CTACTACACTACACAT , 

S82 

ATGTGTAGTGTAGTAG , 

C82 

CACATATCCAAATCTC , 

S34 

GAGATTTGGATATGTG , 

C34 

ATTTTCACATTCTACA , 

S83 

TGTAGAATGTGAAAAT , 

c83 

CACCATAAATCATCAT , 

S35 

ATGATGATTTATGGTG , 

C35 

CAATATTCTTACACCA , 

S84 

TGGTGTAAGAATATTG , 

c84 

TTCACAAACCTTAATT , 

S36 

AATTAAGGTTTGTGAA , 

C36 

T  CAT ACT  TT AT  C  CT  CA , 

S85 

TGAGGATAAAGTATGA , 

Css 

ACCTCTTTATTCAAAC , 

S37 

GTTTGAATAAAGAGGT , 

C37 

TTTCCAATACATCAAT , 

S86 

ATTGATGTATTGGAAA , 

Cs6 

TCCTTACACAAATAAC , 

S38 

GTTATTTGTGTAAGGA , 

C38 

AAATCAAATACCTCTC , 

S87 

GAGAGGTATTTGATTT , 

C87 

ACAAATCACTTAAACA , 

S39 

TGTTTAAGTGATTTGT , 

C39 

TCCATTCCTTCTAATA , 

Ss8 

TATTAGAAGGAATGGA , 

Css 

CTCCTTCAATTTTCAA , 

S40 

TTGAAAATTGAAGGAG , 

C40 

TCCTTATTACTCCTAC , 

S89 

GTAGGAGTAATAAGGA  f 

C89 

ATCAACACTATTCATC , 

S41 

GATGAATAGTGTTGAT , 

C41 

CTTCTCTTATTTCACA , 

S90 

TGTGAAATAAGAGAAG , 

C90 

TTCTTTCCTCATTTTT , 

S42 

AAAAATGAGGAAAGAA , 

C42 

TACTTCCTTTTTCTTC , 

S91 

GAAGAAAAAGGAAGTA , 

C91 

TATCACTCATACATCT , 

S43 

AGATGTATGAGTGATA , 

C43 

ACCACACACTATATAT , 

S92 

ATATATAGTGTGTGGT , 

C92 

CAACCATAT  CAAAAAC , 

S44 

GTTTTTGATATGGTTG , 

C44 

TACTTCACTAATTCCT , 

S93 

AGGAATTAGTGAAGTA , 

C93 

AATATCCTTTCTCACT , 

S45 

AGTGAGAAAGGATATT , 

C45 

CTTCAACAACTAACTA , 

S94 

TAGTTAGTTGTTGAAG , 

C94 

AAACTATTTCCACTCT , 

S46 

AGAGTGGAAATAGTTT , 

C46 

ACACCTATTATT  CT  CT , 

S95 

AGAGAATAATAGGTGT , 

C95 

TTTTTCATTCACCTTT , 

S47 

AAAGGTGAATGAAAAA , 

C47 

ACAATCAATTCTACTC , 

S96 

GAGTAGAATTGATTGT , 

C96 

ATAACCAACCTACATT , 

S48 

AATGTAGGTTGGTTAT , 

C48 

CTATCAAACTCCTTTT , 

S97 

AAAAGGAGTTTGATAG , 

C97 

ACCATTTTTATTCCAT , 

S49 

ATGGAATAAAAATGGT , 

C49 

FIG.  4.  DNAfGT  ( 16,  {GGG,  CCC},  7.85)  code  with  additional  constraints. 


Thus  considering  that  DNAj(, T(n,  L.  g)  codes  do  not  ignore  word-word  crosshybrization  potential  and 
use  a  more  restrictive  measure  free  energy  gap,  the  code  given  in  Figure  4  is  much  more  restrictive  than 
S  8  —  2.  The  number  of  strands  in  S  8—2  is  80  while  the  number  of  pairs  of  strands  in  the  exhibited 
DNAfGT  (16,  {GGG,  CCC},  7.85)  is  N  =  97.  It  should  be  noted  that  there  are  other  bonding  specificity 
constraints  considered  in  S8  —  2  that  are  not  considered  in  the  code  given  in  Figure  4  and,  in  light  of  the 
random  coding  bound  data  t\  and  e\  in  Figure  3,  S8  —  2  is  a  good  design. 
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